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ABSTRACT 

Motivation Many computerized methods for RNA-RNA interaction 
structure prediction have been developed. Recently, 0(N e >) time 
and 0(N 4 ) space dynamic programming algorithms have become 
available that compute the partition function of RNA-RNA interaction 
complexes. However, few of these methods incorporate the 
knowledge concerning related sequences, thus relevant evolutionary 
information is often neglected from the structure determination. 
Therefore, it is of considerable practical interest to introduce a method 
taking into consideration both thermodynamic stability and sequence 
covariation. 

Results We present the a priori folding algorithm ripaiign, 
whose input consists of two (given) multiple sequence alignments 
(MSA), ripaiign outputs (1) the partition function, (2) base-pairing 
probabilities, (3) hybrid probabilities and (4) a set of Boltzmann- 
sampled suboptimal structures consisting of canonical joint structures 
that are compatible to the alignments. Compared to the single 
sequence-pair folding algorithm rip, ripaiign requires negligible 
additional memory resource. Furthermore, we incorporate possible 
structure constraints as input parameters into our algorithm. 
Availability The algorithm described here is implemented in 
C as part of the rip package. The supplemental material, 
source code and input/output files can freely be downloaded from 

|http : / /www . combinatorics . cn/c bpc/ ripa iign . html| 
Contact Christian Reidys duck@santaf e . edu 



Keywords multiple sequence alignment, RNA-RNA interaction, 
joint structure, dynamic programming, partition function, base 
pairing probability, hybrid, loop, RNA secondary structure. 



1 INTRODUCTION 

RNA-RNA interactions play a major role at many different 
levels of the cellular metabolism such as plasmid replication 
control, viral encapsidation, or transcriptional and translational 
regulation. With the discovery that a large number of transcripts 
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in higher eukaryotes are noncoding RNAs, RNA-RNA interactions 
in cellular metabolism are gaining in prominence. Typical 
examples of i nterac tions involving two RNA mol ecules are snRNAs 
dForne et q/.|,|l996h : snoRNAs with their targets dBachellerie et all 
|2002|) : micro-RNAs from the RNAi pathway with their mRNA 
target dAmbrosll20oi iMurchison and Hannon . 2004); s RNAs from 



Escherichia coli dHershberg et Repoi la et al and 

sRNA loop-loop interactions dBrunel et all l2003h . The common 
feature in many ncRNA classes, especially prokaryotic small RNAs, 
is the formation of RNA-RNA interaction structures that are much 
more complex than the simple sense-antisense interactions. 

As it is the case for the general RNA folding problem 
with unrestricted pseudoknots dAkutsul l2000t) . the RNA-RNA 
intera ction problem (RIP) is NP-comple te in its most general 
form dAlkan et a/.ll200d : lMneimnehl.l20091) . However, polynomial- 
time algorithms can be derived by restricting the space of 
allowed configurat ions in ways that are similar to pseudoknot 
folding algorithms dRivas and Eddvll 19991) . The simplest approach 
concatenates the two interacting sequences and subsequently 
employs a slightly modified standard seco ndary structure f oldin 
algorithm. The algorithms RNAcof old dHofacker et al. , 1 199' 



iBernhart et all 20061) , pai rfold dAndronescu et all 120051) . and 
NUPACK dRenefa/.L l2005h subscribe to this strategy. A major 



shortcoming of this approach is that it cannot predict important 
motifs such as kissing-hairpin loops. The paradigm of concatenation 
has also been general ized to the pseudoknot folding algorithm of 
iRivas and Eddvl dl999h . The resulting model, however, still does no t 
generate all relevant interaction structures dChitsaz et all l2009b|) . 
An alternative line of thought is to neglect all internal base-pairings 
in either strand and to compute the minimum free energy (MFE) 
secondary structure for their hybridizatio n under this constraint. For 
instance, RNAduplex and RNAhvb rid dRehmsmeier et all 200' 
follows this line of thought. RNAup drvluckstein et all 120061 1200! 
and intaRNA dBusch et al. restrict interactions to a single 

interval that remains unpaired in the secondary structure for each 
partner. These models hav e proved particularly useful fo r bacterial 
s RNA/mRNA i nt eracti ons dGeissmann and Touatill2004h . 

IPervouchinei d2004T) and lAlkan et al. I d2006l) independently 
proposed MFE folding algorithms for predicting the joint 
structure of two interacting RNA molecules with polynomial time 



© Oxford University Press 2010. 



1 



A.X. Li, M. Marz, J. Qin, CM. Reidys 



complexity. In their model, a "joint structure" means that the 
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intramolecular structures of each molecule are pseudoknot-free, the 
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so-called "zig-zags", see supplement material (SM) for detailed 
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definition. The optimal joint structure is computed in 0(N 6 ) time 
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and 0(N 4 ) space via a dynamic programming (DP) routine. 
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A more reliable approach is to consider the partition function, 
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which by construction integrates over the Boltzmann-weighted 
probability space, allowing for the derivation of thermodynamic 
quantities, like e.g. equilibrium concentration, melting temperature 
and base-pairing probabilities. The p artition funct i on of joint 
structures was inde pendently derived bv lChitsaz et ali d2009bl) and 
J2009I) while the base-pairing probabilities are due to 
J2009I) . 



Huan g et al. 
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A key quantity here is the probability of hybrids, which cannot 
be recovered from b ase pairing probabi lities since the latter can 
be highly correlated. iHuang et ali d2010h presented a new hybrid- 
based decomposition grammar, facilitating the computation of the 
nontrivial hybrid-probabilities as well as the Boltzmann sampling 
of RNA-RNA interaction structures. The partition function of 
joint structures can be computed in 0(N 6 ) time and 0(N 4 ) 
space and current implement ations require very large computational 
resources. ISalarigf a/.l ( l2009h recently achieved a substantial speed- 
up making use of the observation that the external interactions 
mostly occur between pairs of unpaired regions of single structures. 
IChitsaz et al. I d2009al) introduced tree-structured Markov Random 
Fields to approximate the joint probability distribution of multiple 
(> 3) contact regions. 

Unfortunately, incompleteness of the underlying energy model, 
in particular for hybrid- and kissing-loops, may result in prediction 
inaccuracy. One way of improving this situation is to involve 
phylogenetic information of multiple sequence alignments (MSA). 

In an MSA homologous nucleotides are grouped in columns, 
where homologous is interpreted in both: structural as well 
as evolutionary sense. I.e. a column of nucleotides occupies 
similar structural positions and all diverge from a common 
ancestral nucleotide. Also, many ncRNAs show clear signs of 
undergoing compensatory mutations along evolutionary trajectories. 
In conclusion, it seems reasonable to stipulate that a non-negligible 
part of the existing RNA-RNA interacti ons contain preserve d 
but covarying patterns of the interactions dSeemann et ali . |2010|) . 
Therefore we can associate a consensus interaction structure to pairs 
of interacting MSAs (see Section | 2.1| l. 

Along these lines ISeemann~ ali J2010I) presented an algorithm 
PETcofold for prediction of RNA-RNA interactions including 
pseudoknot s in given MSAs. Th eir algorithm is an extension of 
PETfold iSeemann ef q/.L[20o3) using elements of RNAcof old 
( iBernhart et al. , 2006h and computational strategies for h ierarchical 
folding dGaspin and WesthoIll995Hlabbari et qi.U2007l) . However, 
PETcofold is an approximation algorithm and further differences 
between the two approaches will be discussed in Section ??. 

Here, we present the algorithm ripalign which computes the 
partition function, base-pairing as well as hybrid probabilities and 
performs Boltzmann-sampling on the level of MSAs. ripalign 
represents a generalization of rip to pairs of interacting MSAs 
and a new grammar of canonical interaction structures. The latter 
is of relevance since there are no isolated base pairs in molecular 
complexes. 



Table 1. Preprocessing in ripalign: Given a pair of MSAs (R, S), 
where R consists of three aligned RNA sequences of species (sp.) 0\ or 
02 ■ S in turn consists of four aligned sequences of species 8± and #2 ■ Then 
we obtain the matrix-pair (R, S), where (IV, S 1 ), 1 < i < 6, ranges over 
all the six potentially interacting RNA-pairs. 



One important step consists in identifying the notion of a joint 
structure compatible to a pair of interacting MSAs. Our notion 
is based on the framework of [Hofacker et ali d2002l) . where a 
sophisticated cost function capturing thermodynamic stability as 
well as sequence covariation is employed. Furthermore ripalign 
is tailored to take structure constraints, such as blocked nucleotides 
known e.g. from chemical probing, into account. 



2 THEORY 

2.1 Multiple sequence alignments and compatibility 

A MSA, R consists of 

m R RNA sequences of known species. Denoting 
the length of the aligned sequences by N, R constitutes a X N matrix, 
having 5' — 3' oriented rows, R ! and columns, R;. Its (i, j)-th entry, R* , 
is a nucleotide, A. U, G, C or a gap denoted by . . 

For any pair (R, S) we assume that S is a mg X M matrix, whose rows 
carry 3' — 5' orientation. 

In the following we shall assume that a pair of RNA sequences can only 
interact if they belong to the same species. A pair (R, S), can interact if for 
any row R l , there exist at least one row in S that can interact with R\ 

Given a pair of interacting MSAs (R, S), let m be the total number 
of potentially interacting pairs, ripalign exhibits a pre-processing step 
which generates a m X A^-matrix R and a m X M -matrix S such that 
(R 1 , S l ) range over all m potentially interacting RNA-pairs, see Tab. 1 and 
the SM, Section 1.2. 

We shall refer in the following to R and S as MSAs ignoring the fact that 
they have multiple sequences. 

We proceed by defining joint structures that are compatible to a fixed 
(R, S). To this e nd, let us briefly review some concepts introduced in 
iHuangef q/1d2009l) . 

A joint structure J(R, S, I) is a graph consisting of 
(jl) Two secondary structures R and S, whose backbones are drawn as 
horizontal lines on top of each other and whose arcs are drawn in the 
upper and lower halfplane, respectively. We consider R over a 5' to 3' 
oriented backbone (Hi , . . . , Rn) and S over a 3' to 5' oriented backbone 
(Si , . . . , Sm ) and refer to any R- and 5-arcs as interior arcs. 
(j2) An additional set /, of noncrossing arcs of the form RiSj (exterior arc), 
where Ri and Sj are unpaired in R and S. 
(j3) J(R, S, I) contains no "zig-zags" (see SM). 

The subgraph of a joint structure J(R, S, I) induced by a pair of 
subsequences (Ri, i?i+i, ...,Rj) and (S^, S^+i, . . . , Sg) is denoted by 
Ji,j;h,i- In particular, J(R,S,I) = Ji,n-,i,m and Jij-h,l C J a ,b;c,d 
if and only if Jij-h £ is a subgraph of J a j. c ^ induced by (Ri, . . . , Rj) 
and (Sh, . . . , Si). In particular, we use S[i,j] to denote the subgraph 



of Ji,jv ; i,M induced by (Si,Si+i, 
S[i,i- 1] = 0. 



, Sj), where 



Si and 
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Fig. 1. The four basic types of tight structures are given as follows: o 



{RiS h } = Ji,j-h,l and i = j, h 



RiRj G Ji 



and 



Sh.S( £ Ji,j-,h,e'< ^ ■ ShSe} S ^ '■ ShSe 6 Ji,j-,h,e 

and RiRj Ji,j;h,i- 



Given a joint s tructure, J a ,b;c,d' a u § nt structure (TS), Ji t j-h,l, 
fauang et all |200^) is a specific subgraph of J a j,. c ^ indexed by its type 
£ {o, \7, □, A}, see Fig.[T] For instance, we use J^j.^ g to denote a TS of 
type 

A hybrid is a joint structure ■j 1 j ( , i- e - a maximal sequence 

of intermolecular interior loops consisting of a set of exterior arcs 
(Rii Sj 1 , . . . , Ri e Sj e ) where R ih Sj h is nested within Ri h+l Sj h+l and 
where the internal segments R[ih + 1, ih+i — 1] and S[jh + 1, jh+i — 1] 
consist of single-stranded nucleotides only. That is, a hybrid is the maximal 
unbranched stem-loop formed by external arcs. 

A joint structure J(R, S, I) is called canonical if and only if: 
(cl) each stack in the secondary structures R and S is of size at least two, 
i.e. there exist no isolated interior arcs, 
(c2) each hybrid contains at least two exterior arcs. 
In the following, we always assume a joint structure to be canonical. 

Next, we come to (R, S)-compatible joint structures. In difference to 
single sequence compatibility, this notion involves statistical information of 
the MSAs. 

The key point consists in specifying under which conditions two vertices 
contained in (Ri, . . . , Rn, Si, ... , Sm) can pair. This is obtained by a 
generalization of the RNAalifo Id approach faofacker efaDi |2002|) . We 
specify these conditions for interior (c^j), i c fj) an d exterior pairs (c^' 8 ) 
in eq. CDl-l|2~5l 

For interior arcs (Ri, Rj), let X,Y G {A, U, G, C}. Let /.^(XY) be the 
frequency of (X, Y) which exists in the 2-column sub-matrix (R^, Rj) as a 
row-vector and 



C 5= E /*(XY)D*, X <Y'/*(X'Y'). 
XY,X'Y' 



(2.1) 



Here XY and X'Y' independently range over all 16 elements of 
{A,U,G,C} x {A,U,G,C} and D>* x , y , = d H (XY,X'Y'), i.e. the 
Hamming distance between XY and X'Y' in case of XY and X'Y' being 
Watson-Crick, or GU wobble base pair and 0, otherwise. Furthermore, we 
introduce g|V to deal with the inconsistent sequences 

m h 

where S(x, y) is the Kronecker delta and nj 1 ■ (R) is equal to 1 if R^ and 
R^ are Watson-Crick or GU wobble base pair and 0, otherwise. Now we 
obtain BfV = CFj — (friq^j. Based on sequence data, the threshold for 
pairing B R as well as the weight of inconsistent sequences <f>i are computed 
we have 

(<&) > /; H (2.3) 

The case of two positions Si and Sj is completely analogous 

(cfj) Bfj>B?, (2.4) 

where Bfj and Bf are analogously defined. 

As for (c^j 8 ) a further observation factors in: since many ncRNA show 
clear signs of undergoing compensatory mutations in the course of evolution 



Fig. 2. Interior loop energy: An interior loop formed by RiRj and R^Ri, 
where i < h < I < j are the alignment positions. Grey bands are used to 
denote the positions we omit between segment (i, h), (h, £) and (I, j). 



JSeemann et all l201fj ; iMarz et all 2008), we postulate the existence of 
a non-negligible amount of RNA-RNA interactions containing conserved 
pairs, consistent mutations, compensatory mutations as well as inconsistent 
mutations. Based on this observation we arrive at 



(c R ' S ) B R ' S >B™' S , 

i,i > i,i — * ' 



(2.5) 



where B i ! and B, ' are analogously defined as the case for £? R and 

A joint structure J is compatible to (R, S) if for any J-arc, the 
corresponding intra- or inter-positions can according to eq. 42.3t - f2~5t pair. 



2.2 Energy model 

According to iHuang et al 1 120091) joint structures can be decomposed into 
disjoint loops. These loop-types include standard hairpin-, bulge-, interior- 
and multi-loops found in RNA secondary structures as well as hybrid 
and ki ssing-loops. Following the energy parameter rules of iMathews et all 
jl999h , the energy of each loop can be obtained as a sum of the energies 
associated with non-terminal symbols, i.e. graph properties (sequence 
independent) and an additional contributions which depend uniquely on the 
terminal bases (sequence dependent). 

Suppose we are given a joint structure J, compatible to a pair T = 
(R, S). Let L g J be a loop and let ; represent the loop energy of 
the i-th interaction-pair (R l , S l ). Then the loop energy of 7 is 



(2.6) 



We consider the energy of the structure as the sum of all loop contributions: 
?J = J2 ?L,y- (2.7) 

L£J 

To save computational resources, gaps are treated as bases in ripalign. 
Thus only alignment positions contribute as indices and loop sizes. Since 
no measured energy parameters for nonstandard base-pairs are available at 
present time, additional terminal-dependent contributions for the latter are 
ignored. For instance, let Int; j.;, ; denote an interior loop formed by RiRj 
and RhRi and 3^ «p ' denote the free energy of Int^ j;^; with respect to 
the aligned sequences in 7. Then 9^ jp' associated to the three aligned 
subsequences of Fig.[2]reads 



i,j;h,l ~ ~^\- i '- r i,j;h,l +" (j *,G,C;G,C "+" ^i.G.UjG.U ' W.G.Qgap.gap)- 



Int i /-ilnt i ^-flnt _ 

~;gap,gap/ 

(2.8) 

Here G^. h t represents contributions related exclusively to the positions 
of the interior loop while G,Vj l( ;n represents additional contributions 
related to the specific nucleotides which form the interior loop. We set 

G »,G,C;ga P ,gap to be zero - 

2.3 The grammar of canonical joint structures and the 
partition function 

The partition function algorithm is easily extended to work with the modified 
energy functions given in eq. \2.1\ . The reformulation of the original hybrid- 
grammar into a grammar of canonical joint structures represents already for 
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Fig. 3. Examples of two TS-types. We display v. 0, or A-tight structures: 
Type cc (top) and Type c (bottom). 



single interaction pairs a significant improvement in prediction quality. The 
original rip-grammar would oftentimes encounter joint structures having a 
hybrid composed by a single isolated exterior arc, see Fig.! 
In order to decompose canonical joint structures via the unambiguous 
grammar introduced in Section 12.31 we distinguish the two types (Type cc 
and Type c) of TS's of type y, A or □. Given a TS of type v. denoted by 
J 7,j;h,l< we write depending on whether R i+1 R j - 1 e JY,y,h,v J 7,j;h,l 
and JYj °h V res P ec tively. Analogously, we define e , J^'^h i an d 

T A.cc tA.c t— P71 

J • >. i, J ■ u i, see he. [31 

Fig-S]summarizes the two basic steps of the canonical-grammar: (I) mterior 
arc-removal to reduce TS, and (II) Mock-decomposition to split a joint 
structure into two smaller blocks. The key feature here is, that since J is 
canonical, the smaller blocks are still canonical after block-decomposition. 
Each decomposition step displayed in Fig. |4]results in substructures which 
eventually break down into generalized loops whose energies can be directly 
computed. More details of the decomposition procedures are described in 
Section 2 of the SM, where we prove that for any canonical joint structure 
J, there exists a unique decomposition-tree (parse-tree), denoted by Tj, see 

Fig.m 



2.4 Probabilities and the Boltzmann Sampling 

A dynamic programming scheme for the computation of a partition 
function implies a corresponding computation of probabilities of specific 
substructures is obtained "from the outside to the inside" and a stochastic 
backtracing procedure that can be used to sample from t he associated 
distri bution iMcCaskilt Il990t iDing and LawrenceL 120031 : iHuang et ali 
|2010|) . We remark that the time complexity does not increase linearly as a 
function of m (see SM Table. 5). 

Along the lin es of the design of the Vienna software package 

faofackerefaZlll994) . ripal ign now offers the following features as 
optional input parameters: 

(1) a position i can be restricted to form an interior or an exterior arc. 
(denoted by "— " and respectively); 

(2) a position i can be forced to be unpaired (denoted by "x"); 

(3) a position i can be restricted to form an (interior or an exterior) arc with 
some position j (denoted by "*"); 

(4) a pair of positions i and j can be forced to form an interior or exterior 
arc (denoted by "( )" or "[]", respectively). 

However, the above features are optional. Thus ripalign can deal with 
both scenarios: the absence of any a priori information and the existence of 
specific information, e.g the location of the Sm-binding site, see Fig. [8] 



Procedure (a) 

□j— an or _m m-^EaLnorrnnQi or \m 

rrrn— >• rorg mn— * mrg ° r J 




tC en m Ea ^ M — — — □ = 

1 M 

ABC DEFGHJK 




L M N O P Q 



Fig. 4. Grammar: Illustration of the decomposition of Ji,jV;i,Af: DTS, 
RTS and hybrids in Procedure (a) and of tight structures in Procedure (b). In 
the bottom row the symbols for the 1 6 distinct types of structural components 
are listed: A: arbitrary joint structure Ji jyi M (canonical); B: right-tight 
structures J^J! r s \ C: double-tight structure s ; D: tight structure 

J 7j;h,l< J tj-h,l or J ?,£h,t< E: h y brid stm cture F: substructure of 

a hybrid J s h .. h t such that RiSj and RhSi are exterior arcs and .. h e 
itself is not a hybrid since it is not maximal; G, H: maximal secondary 
structure segments R[i, j], S[r, s]; J: isolated segment R[i, j] ov S[h,£]; K: 
maximal secondary structure segments appear in pairs such that at least one 
of them is not empty. L: tight structure J ' cc ; M: tight structure J. ' c ; 
N: tight structure J7j°r S '' tight structure J7j C r S ' ^ : tight structure 
jA,cc q jj , j structure J A,C . 

3 RESULTS AND DISCUSSION 

In this paper we present an a priori 0{N 6 ) time and 0(N 4 ) space dynamic 
programming algorithm ripalign, whose input consists of a pair of 
interacting MSAs. ripalign requires only marginally more computational 
resources but is, without doubt, still computationally costly. Approximation 
algor ithms are much faster, for instance PETcofold fceemann et ali 
l201fj|) . having a time complexity of 0(m [N + M) 3 n), where m is the 
number of sequences in MSA, A' and M being the sequence lengths of the 
longer and shorter alignment, respectively, and n < N/2 is the number 
of iterations for the adaption of the threshold value to find likely partial 
secondary structures. Their basic assumption is that the two secondary 
structures fold independently and that intra-loop evaluation differences are 
negligible. The flip-side of reducing the complexity of a folding problem by 
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Fig. 5. Example of the parse tree. The parse tree of the canonical joint 
structure Ji,i7 ; i,g. 



introducing additional assumptions, is however, the uncertainty of the quality 
of the solution. Point in case here is that the two secondary structures did 
not evolve independently, but rather correlated by means of their functional 
interaction. We remark that ripalign (within its complexity limitations) 
is capable to describe the space of RNA interaction structures, for instance 
via Boltzmann sampling, in detail and transparency, 
ripalign represents significant improvements in the following aspects: 

(a) we incorporate evolutionary factors into the RNA-RNA interaction 
structure prediction via alignments as input, 

(b) we introduce the grammar of canonical joint structures of interacting- 
alignments, 

(c) we a priori factor in structural-constraints, like for instance, knowledge 
on Sm-binding sites. 

Below we shall discuss (a), (b) and (c) in more detail in the context of 
concrete examples. All the MSAs involving in (a), (b) and (c) are listed 
in SM, Section 2. 

(a): The fhlAIOxyS interaction 
The OxyS RNA represses flilA mRNA translation ini t iation through base- 
pairing with two short sequence jArgaman and Altuvial |200fJ), one of which 
overlaps the ribosome binding sequence and the other resides further 
downstream, within the coding region of flilA. Our algorithm predicts 
correctly both interaction sites based on MSAs, see Fig. [6] In addition, 
most predicted stacks in the secondary structures of flilA and OxyS agree 
well with the most frequent Bolztmann sampled structure. Two more 
hybrids, J^g 5 g. 41 44 and 53.43 50 are predicted in our output. The 
two additional contact regions, identified in the partition function, exhibit 
a significantly lower probability. An additional hairpin over R[72, 89] is 
predicted in flilA, instead of the unpaired segment occurring in the natural 
structure, can be understood in the context of minimizing free energy. 
Comparing the prediction based on the MSAs (Fig. [6] middle) with the one 
based on the consensus sequence (Fig. [6] bottom), we observe: 

(1) the secondary structure of flilA agrees better with the annotation joint 
structure (Fig. [6] top), 

(2) the leftmost hybrid agrees better with that of the annotated structure. 

(3) the binding-site probability (see SM, Section 5, eq. (5.5)) of the leftmost 
hybrid increases by nearly 40%. 

On the flip side, due to the gaps in seven out of eight subsequences 
induced by R[98, 102] (Column 98-102 in/WA), the prediction quality of the 
right-most hybrid and its corresponding contact-region probability decreases 
slightly. 

Let us next contrast our results with those of PETcofold, see Fig. [7] 
The latter predicts one of the two interaction sites. The second site is 
predicted subjec t to the condition that constrained stems were not extended 
jSeemann et all l201fj) . It can furthermore be observed that in order to 
predict the second hybrid, at the same time the secondary structures 
prediction of both flilA and OxyS gets worse, ripalign predicts both: the 
interaction sites situated in/WA and comes close to predicting the secondary 
structures of flilA as well as OxyS without any additional constraints. 
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II 


III 
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r Hy 

J 37,40;79,82 


rHy 

J 40, 41:50.51 


T Hy 

J 5, 6:9,10 
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rHy 

J 40, 41:50,51 


rHy' 

J 39, 40:51, 52 


rHy 

J 76, 78;90, 92 
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J 76, 78:90, 92 
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J 76,78;90,92 


T Hy 

J 37, 40;79,82 
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rHy 

J ll,12;9,10 


rHy 

J 78, 80:89, 91 
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rHy 

J 16, 18:33, 35 
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J 78, 80:89, 91 


rHy' 

J ll, 12;51.52 
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T Hy 

" , 54, 57:65,68 


rHy 

J 54, 57:65, 68 


T Hy 

J 16, 17:47,48 



Table 2. Top 6 probable hybrids predicted by rip and ripalign: 

Interaction of two specific RNA molecules, SL1 and SmY-10 of 
Caenorhabditis elegans as illustrated in Fig. [jp The top 6 prob able 
hybrids predicted by rip implemented by iHuang et all 1201 d) is 
shown in column I. The hybrids listed in column II are predicted by 
ripalign without any structure constraint. The hybrids listed in Column 
HI are predicted by ripalign under the structural constraints that 
5'-AAUUUUUG-3'(ii[56, 62]) and 3'-GUUUUAA-5'(S'[25, 31]) are Sm- 
binding sites (colored in red) in SmY-10 and SL-I, respectively. Here, we use 
Jij.h 1 to denote the hybrid induced by R[i, j] and S[h, I]. 



(b) : The SmY-10/SL- l interaction of C. elegans 

iMacMorris et al\ 120071) stipulated that SmY-10 RNA, possible involved in 
frarw-splicing, interacts with the splice leader RNA (SL1 RNA). In Fig. [8] 
we show that the Sm-binding sites (colored in red) of the RNA molecules 
SmY-10 and SL-1 are ii[56, 62] and g[2 5,311, respec t ively. In Fig. [8] the 
top structure is being predicted by rip iHuang et all 120 10). We observe 
that firstly a stack in SmY-10 consisting of the single arc i?24'S67 and 
secondly the nucleotides of the Sm-binding sites form intra base pairs. 
The canonical grammar presented here restricts the configuration ensemble 
to canonical joint structures, resulting in the structure presented in Fig. [8] 
(middle) in which the peculiar isolated interaction arc disappears. However, 
the nucleotides of the Sm-binding sites still form either intra or inter- 
molecular base pairs. Incorporating the structural constraints option we 
derive the bottom structure displayed in Fig. [8] Here the Sm-binding sites 
are single-stranded. In Table. [5]we elaborate this point further and show that 
the combination of canonical grammar and structural constraints eliminate 
unwanted hybrids and "free" the nucleotides attributed to Sm-binding sites 
of unwanted interactions. 

(c) : The U4/U6 interaction 

Two of the snRNAs involved in pre- mRNA splicing, U4 and U 6, are 
known to interact by base pairing jZucker-Aprison ^f all Il988h . We 
divided all known metazoan U4 and U6 snRNAs into three distinct groups 
and alignments: protostomia withou t insects, insects and deuterostomia 
iMarz et a/],l2008l) . lMarz et all feOOSl) observed that insects behave in their 
secondary structure different from other protostomes, see Fig. [9] Comparing 
all the predicted U4IU6 interactions, displayed in Fig. [9] we can conclude: 

(1) the secondary partial structures of the U4IU6 complex for all three 
groups predicted by ripalign agree predominantly with the described 
secondary structu r es in metazoan s I Thomas era/., 1990; Ota kert all |2002| ; 
IShambaugh et all 1 1994 lLopez et all |2008|; IShukla et all |2002l) . e.g. as 
depicted in Fig.[9](top) for C. elegans IZucker-Aprison et q/lll98a) . 

(2) for all three groups, Stem I and II (Fig. [9] top) are highly conserved. 
External ascendancies, such as protein interactions may stabilize stem II 
additionally. 

(3) for all three groups, the 5' hairpin of U4 snRNA seems highly conserved 
to interact with the U6 snRNA. This RNA feature is not fully understood, 
since this element is also believed to contain in traloop interactions and may 
bind to a 15.5kDa protein lVidovic et q/.lf200ch . 

(4) for all metazoans, the U6 snRNA shows conserved intramolecular 
interactions between the 3' part and the region downstream of the 5' -hairpin. 

(5) for deuterostomes (Fig. [9] bottom), with a contact-region probability 
of 45.5%), our algorithm identifies a third U4/U6 i nteraction, S t em II I, 
to be conserved, which agrees with the findings in Ijakab etal 
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RNA-RNA interaction prediction based on multiple sequence alignments 



rip 




Zucker-Aprison etal. (1988) 



ripalign without structure-constraint 



SmY-10 5'- 




ripalign with structure-constraint 




Fig. 8. ripalign versus rip: Interaction of two specific RNA 
molecules, SL1 and SmY-10 of Caenorhabditis elegans. The Sm- 
binding sites (colored in red) in the RNA molecules SmY-10 and SL- 
1 are 5'-AAUUUUUG-3'(R[56, 62]) and 3'-GUUUUAA-5'(S[25, 31]), 
respectively. The joint structure contain a single interi o r arc i?24>5'67( to P) 
is predicted by rip implemented by iHuang et all 120101) . The joint 
structure (middle) is predicted by ripalign without any structural 
constraint. The joint structure (bottom) is predicted by ripalign 
under the structural constraints that 5'-AAUUUUUG-3'(i?[56, 62]) and 
3'-GUUUUAA-S'(S[25,31]) are Sm-binding sites in the RNA molecules 
SmY-10 and SL-1, respectively. The target site (green boxes) probabilities 
computed by ripalign are annotated explicitly if > 10% or just by 
< 10%, otherwise. 
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stem II and Sm-binding site are colored in green and red, respectively. The 
joint structures of protostomia (without insects), insects and deuterostomia 
(from top to bottom) are predicted by ripalign under the Sm-binding 
site constraint. The target site (green boxes) probabilities computed by 
ripalign are annotated explicitly if > 10% or just by < 10%, otherwise. 
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